GIS with and

Getting some data

Set-up

Let's import the packages we will use and set the paths for outputs.


In [1]:
# Let's import pandas and some other basic packages we will use 
from __future__ import division
%pylab --no-import-all
%matplotlib inline
import pandas as pd
import numpy as np
import os, sys


Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib

In [2]:
# GIS packages
import geopandas as gpd
from geopandas.tools import overlay
from shapely.geometry import Polygon, Point
import georasters as gr
# Alias for Geopandas
gp = gpd

In [3]:
# Plotting
import matplotlib as mpl
import seaborn as sns
# Setup seaborn
sns.set()

In [4]:
# Paths
pathout = './data/'

if not os.path.exists(pathout):
    os.mkdir(pathout)
    
pathgraphs = './graphs/'
if not os.path.exists(pathgraphs):
    os.mkdir(pathgraphs)

Initial Example -- Natural Earth Country Shapefile

Let's download a shapefile with all the polygons for countries so we can visualize and analyze some of the data we have downloaded in other notebooks. Natural Earth provides lots of free data so let's use that one.

For shapefiles and other polygon type data geopandas is the most useful package. geopandas is to GIS what pandas is to other data. Since gepandas extends the functionality of pandas to a GIS dataset, all the nice functions and properties of pandas are also available in geopandas. Of course, geopandas includes functions and properties unique to GIS data.

Next we will use it to download the shapefile (which is contained in a zip archive). geopandas extends pandas for use with GIS data. We can use many functions and properties of the GeoDataFrame to analyze our data.


In [38]:
countries = gpd.read_file('https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip')

Let's look inside this GeoDataFrame


In [6]:
countries


Out[6]:
featurecla scalerank LABELRANK SOVEREIGNT SOV_A3 ADM0_DIF LEVEL TYPE ADMIN ADM0_A3 ... NAME_KO NAME_NL NAME_PL NAME_PT NAME_RU NAME_SV NAME_TR NAME_VI NAME_ZH geometry
0 Admin-0 country 5 2 Indonesia IDN 0 2 Sovereign country Indonesia IDN ... 인도네시아 Indonesië Indonezja Indonésia Индонезия Indonesien Endonezya Indonesia 印度尼西亚 (POLYGON ((117.7036079039552 4.163414542001791...
1 Admin-0 country 5 3 Malaysia MYS 0 2 Sovereign country Malaysia MYS ... 말레이시아 Maleisië Malezja Malásia Малайзия Malaysia Malezya Malaysia 马来西亚 (POLYGON ((117.7036079039552 4.163414542001791...
2 Admin-0 country 6 2 Chile CHL 0 2 Sovereign country Chile CHL ... 칠레 Chili Chile Chile Чили Chile Şili Chile 智利 (POLYGON ((-69.51008875199994 -17.506588197999...
3 Admin-0 country 0 3 Bolivia BOL 0 2 Sovereign country Bolivia BOL ... 볼리비아 Bolivia Boliwia Bolívia Боливия Bolivia Bolivya Bolivia 玻利維亞 POLYGON ((-69.51008875199994 -17.5065881979999...
4 Admin-0 country 0 2 Peru PER 0 2 Sovereign country Peru PER ... 페루 Peru Peru Peru Перу Peru Peru Peru 秘鲁 (POLYGON ((-69.51008875199994 -17.506588197999...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
250 Admin-0 country 0 4 China CH1 1 2 Country Macao S.A.R MAC ... 마카오 Macau Makau Macau Макао Macao Makao Ma Cao 澳門 (POLYGON ((113.5586043630001 22.16303131700005...
251 Admin-0 country 6 5 Australia AU1 1 2 Dependency Ashmore and Cartier Islands ATC ... 애시모어 카르티에 제도 Ashmore- en Cartiereilanden Wyspy Ashmore i Cartiera Ilhas Ashmore e Cartier Острова Ашмор и Картье Ashmore- och Cartieröarna Ashmore ve Cartier Adaları Quần đảo Ashmore và Cartier 阿什莫尔和卡捷岛 POLYGON ((123.5970158210001 -12.42831796699988...
252 Admin-0 country 6 8 Bajo Nuevo Bank (Petrel Is.) BJN 0 2 Indeterminate Bajo Nuevo Bank (Petrel Is.) BJN ... 바호 누에보 뱅크 Bajo Nuevo Bajo Nuevo Ilha Baixo Novo Бахо-Нуэво Bajo Nuevo Bajo Nuevo Bank Bajo Nuevo Bank 巴霍努埃沃礁 POLYGON ((-79.9892878899999 15.79494863500008,...
253 Admin-0 country 6 5 Serranilla Bank SER 0 2 Indeterminate Serranilla Bank SER ... 세라냐 뱅크 Serranilla Isla Serranilla Ilha Serranilla Серранилья-Банк Serranilla Bank Serranilla Bank Serranilla Bank 塞拉纳浅滩 POLYGON ((-78.63707434799994 15.86208730700008...
254 Admin-0 country 6 6 Scarborough Reef SCR 0 2 Indeterminate Scarborough Reef SCR ... 스카버러 암초 Scarborough-rif Huangyan Dao Recife de Scarborough Скарборо-Шол Scarboroughrevet Scarborough Shoal Bãi cạn Scarborough 黄岩岛 POLYGON ((117.753887178 15.15436926300009, 117...

255 rows × 95 columns

Each row contains the information for one country.

Each column is one property or variable.

Unlike pandas DataFrames, geopandas always must have a geometry column.

Let's plot this data


In [7]:
%matplotlib inline
fig, ax = plt.subplots(figsize=(30,20))
countries.plot(ax=ax)
ax.set_title("WGS84 (lat/lon)", fontdict={'fontsize':34})


Out[7]:
Text(0.5, 1, 'WGS84 (lat/lon)')

We can also get some additional information on this data. For example its projection


In [8]:
countries.crs


Out[8]:
{'init': 'epsg:4326'}

We can reproject the data from its current WGS84 projection to other ones. Let's do this and plot the results so we can see how different projections distort results.


In [9]:
fig, ax = plt.subplots(figsize=(30,20))
countries_merc = countries.to_crs(epsg=3395)
countries_merc.plot(ax=ax)
ax.set_title("Mercator", fontdict={'fontsize':34})


Out[9]:
Text(0.5, 1, 'Mercator')

In [10]:
cea = {'datum': 'WGS84',
 'lat_ts': 0,
 'lon_0': 0,
 'no_defs': True,
 'over': True,
 'proj': 'cea',
 'units': 'm',
 'x_0': 0,
 'y_0': 0}

fig, ax = plt.subplots(figsize=(30,20))
countries_cea = countries.to_crs(crs=cea)
countries_cea.plot(ax=ax)
ax.set_title("Cylindrical Equal Area", fontdict={'fontsize':34})


Out[10]:
Text(0.5, 1, 'Cylindrical Equal Area')

Notice that each projection shows the world in a very different manner, distoring areas, distances etc. So you need to take care when doing computations to use the correct projection. An important issue to remember is that you need a projected (not geographical) projection to compute areas and distances. Let's compare these three a bit. Start with the boundaries of each.


In [11]:
print('[xmin, ymin, xmax, ymax] in three projections')
print(countries.total_bounds)
print(countries_merc.total_bounds)
print(countries_cea.total_bounds)


[xmin, ymin, xmax, ymax] in three projections
[-180.          -90.          180.           83.63410065]
[-20037508.34278923 -34636982.07825699                inf
                inf]
[-20037508.34278923  -6363885.33192604  20037508.34278924
   6324296.52646162]

Let's describe the areas of these countries in the three projections


In [12]:
print('Area distribution in WGS84')
print(countries.area.describe(), '\n')


Area distribution in WGS84
count     255.000000
mean       84.030717
std       446.307748
min         0.000001
25%         0.069381
50%         5.920517
75%        37.990619
max      6049.574693
dtype: float64 


In [13]:
print('Area distribution in Mercator')
print(countries_merc.area.describe(), '\n')


Area distribution in Mercator
count    2.540000e+02
mean     1.274530e+12
std      6.697560e+12
min      2.196509e+04
25%      9.736132e+08
50%      9.232570e+10
75%      5.377488e+11
max      8.288355e+13
dtype: float64 


In [14]:
print('Area distribution in CEA')
print(countries_cea.area.describe(), '\n')


Area distribution in CEA
count    2.550000e+02
mean     5.757889e+11
std      1.836629e+12
min      1.220383e+04
25%      6.820803e+08
50%      5.507806e+10
75%      3.655901e+11
max      1.698019e+13
dtype: float64 

Let's compare the area of each country in the two projected projections


In [15]:
countries_merc = countries_merc.set_index('ADM0_A3')
countries_cea = countries_cea.set_index('ADM0_A3')
countries_merc['ratio_area'] = countries_merc.area / countries_cea.area
countries_cea['ratio_area'] = countries_merc.area / countries_cea.area
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set_context("talk")
fig, ax = plt.subplots()
sns.scatterplot(x=countries_cea.area/1e6, y=countries_merc.area/1e6, ax=ax)
sns.lineplot(x=countries_cea.area/1e6, y=countries_cea.area/1e6, color='r', ax=ax)
ax.set_ylabel('Mercator')
ax.set_xlabel('CEA')
ax.set_title("Areas")


Out[15]:
Text(0.5, 1.0, 'Areas')

Now, how do we know what is correct? Let's get some data from WDI to compare the areas of countries in these projections to what the correct area should be (notice that each country usually will use a local projection that ensures areas are correctly computed, so their data should be closer to the truth than any of our global ones).

Here we use some of what we learned before in this notebook.


In [16]:
from pandas_datareader import data, wb
wbcountries = wb.get_countries()
wbcountries['name'] = wbcountries.name.str.strip()
wdi = wb.download(indicator=['AG.LND.TOTL.K2'], country=wbcountries.iso2c.values, start=2017, end=2017)
wdi.columns = ['WDI_area']
wdi = wdi.reset_index()
wdi = wdi.merge(wbcountries[['iso3c', 'iso2c', 'name']], left_on='country', right_on='name')

countries_cea['CEA_area'] = countries_cea.area / 1e6
countries_merc['MERC_area'] = countries_merc.area / 1e6
areas = pd.merge(countries_cea['CEA_area'], countries_merc['MERC_area'], left_index=True, right_index=True)


/Users/ozak/anaconda3/envs/GeoPython36env/lib/python3.6/site-packages/pandas_datareader/wb.py:593: UserWarning: Non-standard ISO country codes: 1A, 1W, 4E, 6D, 6F, 6L, 6N, 6X, 7E, 8S, A4, A5, A9, B1, B2, B3, B4, B6, B7, B8, C4, C5, C6, C7, C8, C9, D2, D3, D4, D5, D6, D7, D8, D9, EU, F1, F6, JG, L4, L5, L6, L7, M1, M2, N6, O6, OE, R6, S1, S2, S3, S4, T2, T3, T4, T5, T6, T7, V1, V2, V3, V4, XC, XD, XE, XF, XG, XH, XI, XJ, XK, XL, XM, XN, XO, XP, XQ, XT, XU, XY, Z4, Z7, ZB, ZF, ZG, ZJ, ZQ, ZT
  "Non-standard ISO " "country codes: %s" % tmp, UserWarning

Let's merge the WDI data with what we have computed before.


In [17]:
wdi = wdi.merge(areas, left_on='iso3c', right_index=True)
wdi


Out[17]:
country year WDI_area iso3c iso2c name CEA_area MERC_area
0 Aruba 2017 180.0 ABW AW Aruba 1.697662e+02 1.780777e+02
1 Afghanistan 2017 652860.0 AFG AF Afghanistan 6.421811e+05 9.306783e+05
2 Angola 2017 1246700.0 AGO AO Angola 1.244652e+06 1.307625e+06
3 Albania 2017 27400.0 ALB AL Albania 2.833579e+04 4.983392e+04
4 Andorra 2017 470.0 AND AD Andorra 4.522394e+02 8.305222e+02
... ... ... ... ... ... ... ... ...
258 Samoa 2017 2830.0 WSM WS Samoa 2.780425e+03 2.945930e+03
260 Yemen, Rep. 2017 527970.0 YEM YE Yemen, Rep. 4.530748e+05 4.899480e+05
261 South Africa 2017 1213090.0 ZAF ZA South Africa 1.219825e+06 1.597733e+06
262 Zambia 2017 743390.0 ZMB ZM Zambia 7.519143e+05 7.960516e+05
263 Zimbabwe 2017 386850.0 ZWE ZW Zimbabwe 3.893382e+05 4.355822e+05

213 rows × 8 columns

How correlated are these measures?


In [18]:
wdi.corr()


Out[18]:
WDI_area CEA_area MERC_area
WDI_area 1.000000 0.997308 0.823693
CEA_area 0.997308 1.000000 0.852408
MERC_area 0.823693 0.852408 1.000000

Let's change the shape of the data so we can plot it using seaborn.


In [19]:
wdi2 = wdi.melt(id_vars=['iso3c', 'iso2c', 'name', 'country', 'year', 'WDI_area'], value_vars=['CEA_area', 'MERC_area'])
wdi2


Out[19]:
iso3c iso2c name country year WDI_area variable value
0 ABW AW Aruba Aruba 2017 180.0 CEA_area 1.697662e+02
1 AFG AF Afghanistan Afghanistan 2017 652860.0 CEA_area 6.421811e+05
2 AGO AO Angola Angola 2017 1246700.0 CEA_area 1.244652e+06
3 ALB AL Albania Albania 2017 27400.0 CEA_area 2.833579e+04
4 AND AD Andorra Andorra 2017 470.0 CEA_area 4.522394e+02
... ... ... ... ... ... ... ... ...
421 WSM WS Samoa Samoa 2017 2830.0 MERC_area 2.945930e+03
422 YEM YE Yemen, Rep. Yemen, Rep. 2017 527970.0 MERC_area 4.899480e+05
423 ZAF ZA South Africa South Africa 2017 1213090.0 MERC_area 1.597733e+06
424 ZMB ZM Zambia Zambia 2017 743390.0 MERC_area 7.960516e+05
425 ZWE ZW Zimbabwe Zimbabwe 2017 386850.0 MERC_area 4.355822e+05

426 rows × 8 columns


In [20]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set_context("talk")
fig, ax = plt.subplots()
sns.scatterplot(x='WDI_area', y='value', data=wdi2, hue='variable', ax=ax)
#sns.scatterplot(x='WDI_area', y='MERC_area', data=wdi, ax=ax)
sns.lineplot(x='WDI_area', y='WDI_area', data=wdi, color='r', ax=ax)
ax.set_ylabel('Other')
ax.set_xlabel('WDI')
ax.set_title("Areas")
ax.legend()


Out[20]:
<matplotlib.legend.Legend at 0x12b485d30>

We could use other data to compare, e.g. data from the CIA Factbook.


In [21]:
cia_area = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2147.txt', sep='\t', header=None)
cia_area = pd.DataFrame(cia_area[0].str.strip().str.split('\s\s+').tolist(), columns=['id', 'Name', 'area'])
cia_area.area = cia_area.area.str.replace(',', '').astype(int)
cia_area


Out[21]:
id Name area
0 1 Russia 17098242
1 2 Antarctica 14000000
2 3 Canada 9984670
3 4 United States 9833517
4 5 China 9596960
... ... ... ...
249 250 Spratly Islands 5
250 251 Ashmore and Cartier Islands 5
251 252 Coral Sea Islands 3
252 253 Monaco 2
253 254 Holy See (Vatican City) 0

254 rows × 3 columns


In [22]:
print('CEA area for Russia', countries_cea.area.loc['RUS'] / 1e6)
print('MERC area for Russia', countries_merc.area.loc['RUS'] / 1e6)
print('WDI area for Russia', wdi.loc[wdi.iso3c=='RUS', 'WDI_area'])
print('CIA area for Russia', cia_area.loc[cia_area.Name=='Russia', 'area'])


CEA area for Russia 16980189.48668187
MERC area for Russia 82883546.8527405
WDI area for Russia 200    16376870.0
Name: WDI_area, dtype: float64
CIA area for Russia 0    17098242
Name: area, dtype: int64

Again very similar result. CEA is closest to both WDI and CIA.

Exercise

  1. Merge the CIA data with the wdi data. You need to get correct codes for the countries to allow for the merge or correct the names to ensure they are compatible.
  2. Change the dataframe as we did with wdi2 and plot the association between these measures

Mapping data

Let's use the geoplot package to plot data in a map. As usual we can do it in many ways, but geoplot makes our life very easy. Let's import the various packages we will use.


In [23]:
import geoplot as gplt
import geoplot.crs as gcrs
import mapclassify as mc
import textwrap

Let's import some of the data we had downloaded before. Specifically, let's import the Penn World Tables data.


In [24]:
pwt = pd.read_stata(pathout + 'pwt91.dta')
pwt_xls = pd.read_excel(pathout + 'pwt91.xlsx',encoding='utf-8')
pwt


Out[24]:
countrycode country currency_unit year rgdpe rgdpo pop emp avh hc ... csh_x csh_m csh_r pl_c pl_i pl_g pl_x pl_m pl_n pl_k
0 ABW Aruba Aruban Guilder 1950 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 ABW Aruba Aruban Guilder 1951 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 ABW Aruba Aruban Guilder 1952 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 ABW Aruba Aruban Guilder 1953 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 ABW Aruba Aruban Guilder 1954 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12371 ZWE Zimbabwe US Dollar 2013 28086.937500 28329.810547 15.054506 7.914061 NaN 2.504635 ... 0.169638 -0.426188 0.090225 0.577488 0.582022 0.448409 0.723247 0.632360 0.383488 0.704313
12372 ZWE Zimbabwe US Dollar 2014 29217.554688 29355.759766 15.411675 8.222112 NaN 2.550258 ... 0.141791 -0.340442 0.051500 0.600760 0.557172 0.392895 0.724510 0.628352 0.349735 0.704991
12373 ZWE Zimbabwe US Dollar 2015 30091.923828 29150.750000 15.777451 8.530669 NaN 2.584653 ... 0.137558 -0.354298 -0.023353 0.622927 0.580814 0.343926 0.654940 0.564430 0.348472 0.713156
12374 ZWE Zimbabwe US Dollar 2016 30974.292969 29420.449219 16.150362 8.839398 NaN 2.616257 ... 0.141248 -0.310446 0.003050 0.640176 0.599462 0.337853 0.657060 0.550084 0.346553 0.718671
12375 ZWE Zimbabwe US Dollar 2017 32693.474609 30940.816406 16.529903 9.181251 NaN 2.648248 ... 0.141799 -0.299539 0.019133 0.647136 0.726222 0.340680 0.645338 0.539529 0.412392 0.755215

12376 rows × 52 columns

Let's recreate GDPpc data


In [25]:
# Get columns with GDP measures
gdpcols = pwt_xls.loc[pwt_xls['Variable definition'].apply(lambda x: str(x).upper().find('REAL GDP')!=-1), 'Variable name'].tolist()

# Generate GDPpc for each measure
for gdp in gdpcols:
    pwt[gdp + '_pc'] = pwt[gdp] / pwt['pop']

# GDPpc data
gdppccols = [col+'_pc' for col in gdpcols]
pwt[['countrycode', 'country', 'year'] + gdppccols]


Out[25]:
countrycode country year rgdpe_pc rgdpo_pc cgdpe_pc cgdpo_pc rgdpna_pc
0 ABW Aruba 1950 NaN NaN NaN NaN NaN
1 ABW Aruba 1951 NaN NaN NaN NaN NaN
2 ABW Aruba 1952 NaN NaN NaN NaN NaN
3 ABW Aruba 1953 NaN NaN NaN NaN NaN
4 ABW Aruba 1954 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
12371 ZWE Zimbabwe 2013 1865.683105 1881.816040 1874.657715 1898.868286 1952.479736
12372 ZWE Zimbabwe 2014 1895.806519 1904.774048 1918.362305 1935.120605 1947.798950
12373 ZWE Zimbabwe 2015 1907.274170 1847.621094 1924.819824 1902.378662 1934.789307
12374 ZWE Zimbabwe 2016 1917.869873 1821.658813 1932.771973 1889.612061 1901.752686
12375 ZWE Zimbabwe 2017 1977.838257 1871.808716 1998.100098 1940.005371 1913.949829

12376 rows × 8 columns

Let's map GDPpc for the year 2010 using geoplot. For this, let's write two functions that will simplify plotting and saving maps. Also, we can reuse it whenever we need to create a new map for the world.


In [26]:
# Functions for plotting
def center_wrap(text, cwidth=32, **kw):
    '''Center Text (to be used in legend)'''
    lines = text
    #lines = textwrap.wrap(text, **kw)
    return "\n".join(line.center(cwidth) for line in lines)

def MyChloropleth(mydf=pwt.loc[pwt.year==2010], myfile='GDPpc2010', myvar='rgdpe_pc',
                  mylegend='GDP per capita 2010',
                  k=5,
                  extent=[-180, -90, 180, 90],
                  bbox_to_anchor=(0.2, 0.5),
                  edgecolor='white', facecolor='lightgray',
                  scheme='FisherJenks',
                  save=True,
                  percent=False,
                  **kwargs):
    # Chloropleth
    # Color scheme
    if scheme=='EqualInterval':
        scheme = mc.EqualInterval(mydf[myvar], k=k)
    elif scheme=='Quantiles':
        scheme = mc.Quantiles(mydf[myvar], k=k)
    elif scheme=='BoxPlot':
        scheme = mc.BoxPlot(mydf[myvar], k=k)
    elif scheme=='FisherJenks':
        scheme = mc.FisherJenks(mydf[myvar], k=k)
    elif scheme=='FisherJenksSampled':
        scheme = mc.FisherJenksSampled(mydf[myvar], k=k)
    elif scheme=='HeadTailBreaks':
        scheme = mc.HeadTailBreaks(mydf[myvar], k=k)
    elif scheme=='JenksCaspall':
        scheme = mc.JenksCaspall(mydf[myvar], k=k)
    elif scheme=='JenksCaspallForced':
        scheme = mc.JenksCaspallForced(mydf[myvar], k=k)
    elif scheme=='JenksCaspallSampled':
        scheme = mc.JenksCaspallSampled(mydf[myvar], k=k)
    elif scheme=='KClassifiers':
        scheme = mc.KClassifiers(mydf[myvar], k=k)
    # Format legend
    upper_bounds = scheme.bins
    # get and format all bounds
    bounds = []
    for index, upper_bound in enumerate(upper_bounds):
        if index == 0:
            lower_bound = mydf[myvar].min()
        else:
            lower_bound = upper_bounds[index-1]
        # format the numerical legend here
        if percent:
            bound = f'{lower_bound:.0%} - {upper_bound:.0%}'
        else:
            bound = f'{float(lower_bound):,.0f} - {float(upper_bound):,.0f}'
        bounds.append(bound)
    legend_labels = bounds
    #Plot
    ax = gplt.choropleth(
        mydf, hue=myvar, projection=gcrs.PlateCarree(central_longitude=0.0, globe=None),
        edgecolor='white', linewidth=1,
        cmap='Reds', legend=True,
        scheme=scheme,
        legend_kwargs={'bbox_to_anchor': bbox_to_anchor,
                       'frameon': True,
                       'title':mylegend,
                       },
        legend_labels = legend_labels,
        figsize=(24, 16),
        rasterized=True,
    )
    gplt.polyplot(
        countries, projection=gcrs.PlateCarree(central_longitude=0.0, globe=None),
        edgecolor=edgecolor, facecolor=facecolor,
        ax=ax,
        rasterized=True,
        extent=extent,
    )
    if save:
        plt.savefig(pathgraphs + myfile + '_' + myvar +'.pdf', dpi=300, bbox_inches='tight')
        plt.savefig(pathgraphs + myfile + '_' + myvar +'.png', dpi=300, bbox_inches='tight')
    pass

Let's merge the PWT GDPpc data with our shape file.


In [27]:
year = 2010
gdppc = pwt.loc[pwt.year==year].reset_index(drop=True).copy()
gdppc = countries.merge(gdppc, left_on='ADM0_A3', right_on='countrycode')
gdppc = gdppc.dropna(subset=['rgdpe_pc'])
mylegend = center_wrap(["GDP per capita in " + str(year)], cwidth=32, width=32)
MyChloropleth(mydf=gdppc, myfile='PWT_GDP_' + str(year), myvar='rgdpe_pc', mylegend=mylegend, k=10, scheme='Quantiles', save=True)



In [28]:
year = 2000
gdppc = pwt.loc[pwt.year==year].reset_index(drop=True).copy()
gdppc = countries.merge(gdppc, left_on='ADM0_A3', right_on='countrycode')
gdppc = gdppc.dropna(subset=['rgdpe_pc'])
mylegend = center_wrap(["GDP per capita in " + str(year)], cwidth=32, width=32)
MyChloropleth(mydf=gdppc, myfile='PWT_GDP_' + str(year), myvar='rgdpe_pc', mylegend=mylegend, k=10, scheme='Quantiles', save=True)


GIS operations, functions and properties

Let's explore the data with some of the functions of geopandas.

Let's start by finding the centroid of every country and plot it.


In [29]:
centroids = countries.copy()
centroids.geometry = centroids.centroid
ax = gplt.pointplot(
    centroids, projection=gcrs.PlateCarree(central_longitude=0.0, globe=None),
    figsize=(24, 16),
    rasterized=True,
)
gplt.polyplot(countries.geometry, projection=gcrs.PlateCarree(central_longitude=0.0, globe=None),
              edgecolor='white', facecolor='lightgray',
              extent=[-180, -90, 180, 90],
              ax=ax)


Out[29]:
<cartopy.mpl.geoaxes.GeoAxesSubplot at 0x12d476f28>

Let's compute distances between the centroids. For this we will use the geopy package.


In [30]:
from geopy.distance import geodesic, great_circle
import itertools
centroids['xy'] = centroids.geometry.apply(lambda x: [x.y, x.x])

In [31]:
mypairs = pd.DataFrame(index = pd.MultiIndex.from_arrays(
                    np.array([x for x in itertools.product(centroids['ADM0_A3'].tolist(), repeat=2)]).T,
                    names = ['country_1','country_2'])).reset_index()
mypairs = mypairs.merge(centroids[['ADM0_A3', 'xy']], left_on='country_1', right_on='ADM0_A3')
mypairs = mypairs.merge(centroids[['ADM0_A3', 'xy']], left_on='country_2', right_on='ADM0_A3', suffixes=['_1', '_2'])
mypairs


Out[31]:
country_1 country_2 ADM0_A3_1 xy_1 ADM0_A3_2 xy_2
0 IDN IDN IDN [-2.222961002517392, 117.27043333916548] IDN [-2.222961002517392, 117.27043333916548]
1 MYS IDN MYS [3.792392850953034, 109.69886844216687] IDN [-2.222961002517392, 117.27043333916548]
2 CHL IDN CHL [-37.74360561760632, -71.36437463800704] IDN [-2.222961002517392, 117.27043333916548]
3 BOL IDN BOL [-16.7068768105592, -64.68475372880839] IDN [-2.222961002517392, 117.27043333916548]
4 PER IDN PER [-9.15438848075216, -74.37806457210715] IDN [-2.222961002517392, 117.27043333916548]
... ... ... ... ... ... ...
65020 MAC SCR MAC [22.157785198349384, 113.55019686490212] SCR [15.152112822000067, 117.75381196333339]
65021 ATC SCR ATC [-12.432577176848286, 123.58636778644266] SCR [15.152112822000067, 117.75381196333339]
65022 BJN SCR BJN [15.795009963377407, -79.9878658593175] SCR [15.152112822000067, 117.75381196333339]
65023 SER SCR SER [15.864460896333414, -78.63811872766658] SCR [15.152112822000067, 117.75381196333339]
65024 SCR SCR SCR [15.152112822000067, 117.75381196333339] SCR [15.152112822000067, 117.75381196333339]

65025 rows × 6 columns


In [32]:
mypairs['geodesic_dist'] = mypairs.apply(lambda x: geodesic(x.xy_1, x.xy_2).km, axis=1)
mypairs['great_circle_dist'] = mypairs.apply(lambda x: great_circle(x.xy_1, x.xy_2).km, axis=1)
mypairs


Out[32]:
country_1 country_2 ADM0_A3_1 xy_1 ADM0_A3_2 xy_2 geodesic_dist great_circle_dist
0 IDN IDN IDN [-2.222961002517392, 117.27043333916548] IDN [-2.222961002517392, 117.27043333916548] 0.000000 0.000000
1 MYS IDN MYS [3.792392850953034, 109.69886844216687] IDN [-2.222961002517392, 117.27043333916548] 1073.341454 1074.915491
2 CHL IDN CHL [-37.74360561760632, -71.36437463800704] IDN [-2.222961002517392, 117.27043333916548] 15491.447980 15482.921855
3 BOL IDN BOL [-16.7068768105592, -64.68475372880839] IDN [-2.222961002517392, 117.27043333916548] 17899.591177 17899.294953
4 PER IDN PER [-9.15438848075216, -74.37806457210715] IDN [-2.222961002517392, 117.27043333916548] 18217.220296 18207.778329
... ... ... ... ... ... ... ... ...
65020 MAC SCR MAC [22.157785198349384, 113.55019686490212] SCR [15.152112822000067, 117.75381196333339] 893.135033 895.894345
65021 ATC SCR ATC [-12.432577176848286, 123.58636778644266] SCR [15.152112822000067, 117.75381196333339] 3117.742839 3133.757085
65022 BJN SCR BJN [15.795009963377407, -79.9878658593175] SCR [15.152112822000067, 117.75381196333339] 16072.183081 16060.602969
65023 SER SCR SER [15.864460896333414, -78.63811872766658] SCR [15.152112822000067, 117.75381196333339] 16135.829664 16124.702173
65024 SCR SCR SCR [15.152112822000067, 117.75381196333339] SCR [15.152112822000067, 117.75381196333339] 0.000000 0.000000

65025 rows × 8 columns


In [33]:
mypairs.corr()


Out[33]:
geodesic_dist great_circle_dist
geodesic_dist 1.000000 0.999997
great_circle_dist 0.999997 1.000000

Let's now use the cylindrical equal area projection and geopandas distance function to compute the distance between centroids.


In [39]:
centroids_cea = countries_cea.copy()
centroids_cea.reset_index(inplace=True)
centroids_cea.geometry = centroids_cea.centroid
centroids_cea['xy'] = centroids_cea.geometry.apply(lambda x: [x.y, x.x])
mypairs_cea = pd.DataFrame(index = pd.MultiIndex.from_arrays(
                    np.array([x for x in itertools.product(centroids_cea['ADM0_A3'].tolist(), repeat=2)]).T,
                    names = ['country_1','country_2'])).reset_index()
mypairs_cea = mypairs_cea.merge(centroids_cea[['ADM0_A3', 'geometry', 'xy']], left_on='country_1', right_on='ADM0_A3')
mypairs_cea = mypairs_cea.merge(centroids_cea[['ADM0_A3', 'geometry', 'xy']], left_on='country_2', right_on='ADM0_A3', suffixes=['_1', '_2'])

In [35]:
mypairs_cea['CEA_dist'] = mypairs_cea.apply(lambda x: x.geometry_1.distance(x.geometry_2)/1e3, axis=1)
mypairs_cea


Out[35]:
country_1 country_2 ADM0_A3_1 geometry_1 xy_1 ADM0_A3_2 geometry_2 xy_2 CEA_dist
0 IDN IDN IDN POINT (13053566.27147314 -244402.4845064451) [-244402.48450644506, 13053566.27147314] IDN POINT (13053566.27147314 -244402.4845064451) [-244402.48450644506, 13053566.27147314] 0.000000
1 MYS IDN MYS POINT (12211550.16769677 418637.6421584018) [418637.64215840184, 12211550.167696772] IDN POINT (13053566.27147314 -244402.4845064451) [-244402.48450644506, 13053566.27147314] 1071.733796
2 CHL IDN CHL POINT (-7927268.764294639 -3670204.350505374) [-3670204.3505053744, -7927268.764294639] IDN POINT (13053566.27147314 -244402.4845064451) [-244402.48450644506, 13053566.27147314] 21258.681926
3 BOL IDN BOL POINT (-7201342.480506786 -1814919.514580463) [-1814919.5145804633, -7201342.480506786] IDN POINT (13053566.27147314 -244402.4845064451) [-244402.48450644506, 13053566.27147314] 20315.704573
4 PER IDN PER POINT (-8281824.470839065 -999164.6380899205) [-999164.6380899205, -8281824.4708390655] IDN POINT (13053566.27147314 -244402.4845064451) [-244402.48450644506, 13053566.27147314] 21348.736825
... ... ... ... ... ... ... ... ... ...
65020 MAC SCR MAC POINT (12640350.29916989 2390982.020614455) [2390982.0206144545, 12640350.29916989] SCR POINT (13108294.38672515 1656478.33538606) [1656478.3353860602, 13108294.386725154] 870.900300
65021 ATC SCR ATC POINT (13757571.52962934 -1364242.800738117) [-1364242.8007381172, 13757571.529629342] SCR POINT (13108294.38672515 1656478.33538606) [1656478.3353860602, 13108294.386725154] 3089.711474
65022 BJN SCR BJN POINT (-8904208.497110264 1725054.531312234) [1725054.5313122338, -8904208.497110264] SCR POINT (13108294.38672515 1656478.33538606) [1656478.3353860602, 13108294.386725154] 22012.609702
65023 SER SCR SER POINT (-8753955.333704833 1732450.149887001) [1732450.149887001, -8753955.333704833] SCR POINT (13108294.38672515 1656478.33538606) [1656478.3353860602, 13108294.386725154] 21862.381722
65024 SCR SCR SCR POINT (13108294.38672515 1656478.33538606) [1656478.3353860602, 13108294.386725154] SCR POINT (13108294.38672515 1656478.33538606) [1656478.3353860602, 13108294.386725154] 0.000000

65025 rows × 9 columns

Let's merge the three distance measures and see how similar they are.


In [36]:
dists = mypairs[['country_1', 'country_2', 'geodesic_dist', 'great_circle_dist']].copy()
dists = dists.merge(mypairs_cea[['country_1', 'country_2', 'CEA_dist']])
dists


Out[36]:
country_1 country_2 geodesic_dist great_circle_dist CEA_dist
0 IDN IDN 0.000000 0.000000 0.000000
1 MYS IDN 1073.341454 1074.915491 1071.733796
2 CHL IDN 15491.447980 15482.921855 21258.681926
3 BOL IDN 17899.591177 17899.294953 20315.704573
4 PER IDN 18217.220296 18207.778329 21348.736825
... ... ... ... ... ...
65020 MAC SCR 893.135033 895.894345 870.900300
65021 ATC SCR 3117.742839 3133.757085 3089.711474
65022 BJN SCR 16072.183081 16060.602969 22012.609702
65023 SER SCR 16135.829664 16124.702173 21862.381722
65024 SCR SCR 0.000000 0.000000 0.000000

65025 rows × 5 columns


In [37]:
dists.corr()


Out[37]:
geodesic_dist great_circle_dist CEA_dist
geodesic_dist 1.000000 0.999997 0.854771
great_circle_dist 0.999997 1.000000 0.854477
CEA_dist 0.854771 0.854477 1.000000